Working on the dataset of car accidents in India provides an opportunity to explore complex dynamics related to road safety in a specific context.
The diversity of variables, such as road conditions, driver characteristics, vehicle details, and accident causes, allows for a deeper understanding of contributing factors to accidents.
Analyzing these data can not only highlight key challenges in road safety but also provide crucial insights to guide targeted preventive initiatives. By understanding collision patterns, profiles of at-risk drivers, and predominant environmental conditions, we could, if it were a real project, contribute to improving road safety policies and maybe reducing accidents, safer road environment in India.
import pandas as pd
import dash
from dash import dcc, html
from dash.dependencies import Input, Output
import plotly
import plotly.express as px
import pandas as pd
pd.set_option('display.max_columns', None)
plotly.offline.init_notebook_mode()
df = pd.read_csv('road.csv')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12316 entries, 0 to 12315 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Time 12316 non-null object 1 Day_of_week 12316 non-null object 2 Age_band_of_driver 12316 non-null object 3 Sex_of_driver 12316 non-null object 4 Educational_level 11575 non-null object 5 Vehicle_driver_relation 11737 non-null object 6 Driving_experience 11487 non-null object 7 Type_of_vehicle 11366 non-null object 8 Owner_of_vehicle 11834 non-null object 9 Service_year_of_vehicle 8388 non-null object 10 Defect_of_vehicle 7889 non-null object 11 Area_accident_occured 12077 non-null object 12 Lanes_or_Medians 11931 non-null object 13 Road_allignment 12174 non-null object 14 Types_of_Junction 11429 non-null object 15 Road_surface_type 12144 non-null object 16 Road_surface_conditions 12316 non-null object 17 Light_conditions 12316 non-null object 18 Weather_conditions 12316 non-null object 19 Type_of_collision 12161 non-null object 20 Number_of_vehicles_involved 12316 non-null int64 21 Number_of_casualties 12316 non-null int64 22 Vehicle_movement 12008 non-null object 23 Casualty_class 12316 non-null object 24 Sex_of_casualty 12316 non-null object 25 Age_band_of_casualty 12316 non-null object 26 Casualty_severity 12316 non-null object 27 Work_of_casuality 9118 non-null object 28 Fitness_of_casuality 9681 non-null object 29 Pedestrian_movement 12316 non-null object 30 Cause_of_accident 12316 non-null object 31 Accident_severity 12316 non-null object dtypes: int64(2), object(30) memory usage: 3.0+ MB
df.head()
| Time | Day_of_week | Age_band_of_driver | Sex_of_driver | Educational_level | Vehicle_driver_relation | Driving_experience | Type_of_vehicle | Owner_of_vehicle | Service_year_of_vehicle | Defect_of_vehicle | Area_accident_occured | Lanes_or_Medians | Road_allignment | Types_of_Junction | Road_surface_type | Road_surface_conditions | Light_conditions | Weather_conditions | Type_of_collision | Number_of_vehicles_involved | Number_of_casualties | Vehicle_movement | Casualty_class | Sex_of_casualty | Age_band_of_casualty | Casualty_severity | Work_of_casuality | Fitness_of_casuality | Pedestrian_movement | Cause_of_accident | Accident_severity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17:02:00 | Monday | 18-30 | Male | Above high school | Employee | 1-2yr | Automobile | Owner | Above 10yr | No defect | Residential areas | NaN | Tangent road with flat terrain | No junction | Asphalt roads | Dry | Daylight | Normal | Collision with roadside-parked vehicles | 2 | 2 | Going straight | na | na | na | na | NaN | NaN | Not a Pedestrian | Moving Backward | Slight Injury |
| 1 | 17:02:00 | Monday | 31-50 | Male | Junior high school | Employee | Above 10yr | Public (> 45 seats) | Owner | 5-10yrs | No defect | Office areas | Undivided Two way | Tangent road with flat terrain | No junction | Asphalt roads | Dry | Daylight | Normal | Vehicle with vehicle collision | 2 | 2 | Going straight | na | na | na | na | NaN | NaN | Not a Pedestrian | Overtaking | Slight Injury |
| 2 | 17:02:00 | Monday | 18-30 | Male | Junior high school | Employee | 1-2yr | Lorry (41?100Q) | Owner | NaN | No defect | Recreational areas | other | NaN | No junction | Asphalt roads | Dry | Daylight | Normal | Collision with roadside objects | 2 | 2 | Going straight | Driver or rider | Male | 31-50 | 3 | Driver | NaN | Not a Pedestrian | Changing lane to the left | Serious Injury |
| 3 | 1:06:00 | Sunday | 18-30 | Male | Junior high school | Employee | 5-10yr | Public (> 45 seats) | Governmental | NaN | No defect | Office areas | other | Tangent road with mild grade and flat terrain | Y Shape | Earth roads | Dry | Darkness - lights lit | Normal | Vehicle with vehicle collision | 2 | 2 | Going straight | Pedestrian | Female | 18-30 | 3 | Driver | Normal | Not a Pedestrian | Changing lane to the right | Slight Injury |
| 4 | 1:06:00 | Sunday | 18-30 | Male | Junior high school | Employee | 2-5yr | NaN | Owner | 5-10yrs | No defect | Industrial areas | other | Tangent road with flat terrain | Y Shape | Asphalt roads | Dry | Darkness - lights lit | Normal | Vehicle with vehicle collision | 2 | 2 | Going straight | na | na | na | na | NaN | NaN | Not a Pedestrian | Overtaking | Slight Injury |
We display the percentage of null values per columns and proceed to drop them
df.isna().sum() / len(df) * 100
df = df.dropna()
app = dash.Dash(__name__)
app.layout = html.Div([
dcc.Graph(id='accidents-by-day'),
dcc.Checklist(
id='gender-filter',
options=[
{'label': 'Male', 'value': 'Male'},
{'label': 'Female', 'value': 'Female'},
],
value=['Male', 'Female'],
labelStyle={'display': 'block'}
)
])
@app.callback(
Output('accidents-by-day', 'figure'),
[Input('gender-filter', 'value')]
)
def update_graph(selected_genders):
filtered_df = df[df['Sex_of_driver'].isin(selected_genders)]
fig = px.histogram(filtered_df, x='Day_of_week', color='Day_of_week',
category_orders={"Day_of_week": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]},
title='Accidents by Day of the Week',
labels={'Day_of_week': 'Day of the Week'})
return fig
if __name__ == '__main__':
app.run_server(debug=True)
fig2 = px.histogram(df, x='Educational_level', title='Distribution of accidents by educational level')
fig2.show()
fig3 = px.violin(df, x='Day_of_week', y='Age_band_of_driver', title='Age Distribution by Day of the Week')
fig3.show()
df2 = df.copy()
df2['nb_Accident'] = 1
df2 = df2.groupby(['Sex_of_driver', 'Age_band_of_driver']).count().reset_index()
df2
| Sex_of_driver | Age_band_of_driver | Time | Day_of_week | Educational_level | Vehicle_driver_relation | Driving_experience | Type_of_vehicle | Owner_of_vehicle | Service_year_of_vehicle | Defect_of_vehicle | Area_accident_occured | Lanes_or_Medians | Road_allignment | Types_of_Junction | Road_surface_type | Road_surface_conditions | Light_conditions | Weather_conditions | Type_of_collision | Number_of_vehicles_involved | Number_of_casualties | Vehicle_movement | Casualty_class | Sex_of_casualty | Age_band_of_casualty | Casualty_severity | Work_of_casuality | Fitness_of_casuality | Pedestrian_movement | Cause_of_accident | Accident_severity | nb_Accident | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | 18-30 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 |
| 1 | Female | 31-50 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
| 2 | Female | Over 51 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
| 3 | Female | Under 18 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| 4 | Female | Unknown | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 | 140 |
| 5 | Male | 18-30 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 | 987 |
| 6 | Male | 31-50 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 | 906 |
| 7 | Male | Over 51 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 | 352 |
| 8 | Male | Under 18 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 | 194 |
| 9 | Male | Unknown | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 | 249 |
| 10 | Unknown | 18-30 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 |
| 11 | Unknown | 31-50 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 |
| 12 | Unknown | Over 51 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 |
| 13 | Unknown | Under 18 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
| 14 | Unknown | Unknown | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
fig4 = px.scatter_3d(df2, x='Age_band_of_driver', y='Sex_of_driver', z='nb_Accident',
title='Age, Service year of vehicle, and Casualty severity',color='nb_Accident')
fig4.show()
df3 = df.copy()
df3['nb_Accident'] = 1
df3 = df3.groupby(['Time']).count().reset_index()
# We've smoothened the data in order to have a visualisation more readable
df3['nb_Accident'] = df3['nb_Accident'].rolling(10).mean()
fig5 = px.line(df3, x='Time', y='nb_Accident', title='Number of Accidents Over Time')
fig5.show()
import plotly.express as px
fig6 = px.pie(df, names='Light_conditions', color_discrete_sequence=px.colors.sequential.RdBu)
fig6.show()
fig7 = px.treemap(df, path=['Type_of_vehicle', 'Driving_experience'], title='Vehicle Type and Driving Experience')
fig7.show()
fig8 = px.sunburst(df, path=['Weather_conditions', 'Road_surface_type', 'Road_surface_conditions'], title='Weather and Road Conditions')
fig8.show()
fig9 = px.bar(df, x='Types_of_Junction', color='Type_of_collision', title='Junction and Collision Types')
fig9.update_layout(updatemenus=[dict(type='buttons',
showactive=True,
buttons=[dict(label=alignment, method='relayout', args=['xaxis.categoryorder', 'total descending']) for alignment in df['Type_of_collision'].unique()])])
fig9.show()
df['Time'] = pd.to_datetime(df['Time']).dt.hour
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
df['Day_of_week'] = pd.Categorical(df['Day_of_week'], categories=day_order, ordered=True)
df = df.sort_values('Day_of_week')
df.dropna(inplace=True)
/var/folders/p9/flhwy6kx3s75yh91rftjm9dr0000gn/T/ipykernel_26970/4054642662.py:1: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
df10 = df[['Area_accident_occured', 'Time']].value_counts().reset_index()
df10.columns = ['Area_accident_occured', 'Time', 'Count']
df10 = pd.merge(df10, df, on=['Area_accident_occured', 'Time'], how='left')
df10.sort_values('Area_accident_occured', inplace=True)
fig = px.scatter(df10, x="Time", y='Count', animation_frame="Day_of_week", animation_group="Area_accident_occured",
size="Count", color="Area_accident_occured", hover_name="Area_accident_occured",
labels={"Time": "Time of Day", "Count": "Accident Count", "Area_accident_occured": "Area of Accident"})
fig.show()
/Users/alexandrecogordan/miniconda3/lib/python3.11/site-packages/plotly/express/_core.py:2044: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.